The objective of the delivery is to perform an analysis of the electoral data, carrying out the debugging, summaries and graphs you consider, both of their results and the accuracy of the electoral polls.
Specifically, you must work only in the time window that includes the elections from 2008 to the last elections of 2019.
General comments
In addition to what you see fit to execute, the following items are mandatory:
Each group should present before 9th January (23:59) an analysis of the data in .qmd and .html format in Quarto slides mode, which will be the ones they will present on the day of the presentation.
Quarto slides should be uploaded to Github (the link should be provided by a member of each group).
The maximum number of slides should be 40. The maximum time for each group will be 20-22 minutes (+5 minutes for questions).
During the presentation you will explain (summarised!) the analysis performed so that each team member speaks for a similar amount of time and each member can be asked about any of the steps. The grade does not have to be the same for all members.
It will be valued not only the content but also the container (aesthetics).
The objective is to demonstrate that the maximum knowledge of the course has been acquired: the more content of the syllabus is included, the better.
Mandatory items:
Data should be converted to tidydata where appropriate.
You should include at least one join between tables.
Reminder: information = variance, so remove columns that are not going to contribute anything.
The glue and lubridate packages should be used at some point, as well as the forcats. The use of ggplot2 will be highly valued.
The following should be used at least once:
mutate
summarise
group_by (or equivalent)
case_when
We have many, many parties running for election. We will only be interested in the following parties:
PARTIDO SOCIALISTA OBRERO ESPAÑOL (beware: it has/had federations - branches - with some other name).
PARTIDO POPULAR
CIUDADANOS (caution: has/had federations - branches - with some other name)
PARTIDO NACIONALISTA VASCO
BLOQUE NACIONALISTA GALLEGO
CONVERGÈNCIA I UNIÓ
UNIDAS PODEMOS - IU (beware that here they have had various names - IU, podem, ezker batua, …- and have not always gone together, but here we will analyze them together)
ESQUERRA REPUBLICANA DE CATALUNYA
EH - BILDU (are now a coalition of parties formed by Sortu, Eusko Alkartasuna, Aralar, Alternatiba)
MÁS PAÍS
VOX
Anything other than any of the above parties should be imputed as “OTHER”. Remember to add properly the data after the previous recoding.
Party acronyms will be used for the visualizations. The inclusion of graphics will be highly valued (see https://r-graph-gallery.com/).
You must use all 4 data files at some point.
You must define at least one (non-trivial) function of your own.
You will have to discard mandatory polls that:
- refer to elections before 2008
- that are exit polls
- have a sample size of less than 750 or are unknown
- that have less than 1 or less fieldwork days
You must obligatorily answer the following questions (plus those that you consider analyzing to distinguish yourself from the rest of the teams, either numerically and/or graphically)
- Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?
- Which party was the second when the first was the PSOE? And when the first was the PP?
- Who benefits from low turnout?
- How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
- How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
- Which polling houses got it right the most and which ones deviated the most from the results?
You should include at least 3 more “original” questions that you think that it could be interesting to be answer with the data.
Marks
The one who does the most things will not be valued the most. More is not always better. The originality (with respect to the rest of the works, for example in the analyzed or in the subject or …) of what has been proposed, in the handling of tables (or in the visualization), the caring put in the delivery (care in life is important) and the relevance of what has been done will be valued. Once you have the mandatory items with your database more or less completed, think before chopping code: what could be interesting? What do I need to get a summary both numerical and visual?
Remember that the real goal is to demonstrate a mastery of the tools seen throughout the course. And that happens not only by the quantity of them used but also by the quality when executing them.
Some dataviz will be extremely positive valued.
Required packages
Insert in the lower chunk the packages you will need
rm(list =ls())library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
The practice will be based on the electoral data archives below, compiling data on elections to the Spanish Congress of Deputies from 2008 to the present, as well as surveys, municipalities codes and abbreviations.
# NO TOQUES NADAelection_data <-read_csv(file ="./data/datos_elecciones_brutos.csv")
Rows: 48737 Columns: 471
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): tipo_eleccion, mes, codigo_ccaa, codigo_provincia, codigo_municipio
dbl (424): anno, vuelta, codigo_distrito_electoral, numero_mesas, censo, par...
lgl (42): FALANGE ESPAÑOLA DE LA JONS, PARTIDO COMUNISTA DEL PUEBLO CASTELL...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
cod_mun <-read_csv(file ="./data/cod_mun.csv")
Rows: 8135 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): cod_mun, municipio
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 3753 Columns: 59
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): type_survey, id_pollster, pollster, media
dbl (51): size, turnout, UCD, PSOE, PCE, AP, CIU, PA, EAJ-PNV, HB, ERC, EE,...
lgl (1): exit_poll
date (3): date_elec, field_date_from, field_date_to
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
abbrev <-read_csv(file ="./data/siglas.csv")
Rows: 587 Columns: 2
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): denominacion, siglas
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
The data will be as follows:
election_data: file with election data for Congress from 2018 to the last ones in 2019.
tipo_eleccion: type of election (02 if congressional election)
anno, mes: year and month of elections
vuelta: electoral round (1 if first round)
codigo_ccaa, codigo_provincia, codigo_municipio, codigo_distrito_electoral: code of the ccaa, province, municipality and electoral district.
numero_mesas: number of polling stations
censo: census
participacion_1, participacion_2: participation in the first preview (14:00) and second preview (18:00) before polls close (20:00)
votos_blancos: blank ballots
votos_candidaturas: party ballots
votos_nulos: null ballots
ballots for each party
cod_mun: file with the codes and names of each municipality
abbrev: acronyms and names associated with each party
surveys: table of electoral polls since 1982. Some of the variables are the following:
type_survey: type of survey (national, regional, etc.)
date_elec: date of future elections
id_pollster, pollster, media: id and name of the polling company, as well as the media that commissioned it.
field_date_from, field_date_to: start and end date of fieldwork
exit_poll: whether it is an exit poll or not
size: sample size
turnout: turnout estimate
estimated voting intentions for the main parties
Cleaning the data – surveys
# Filter datasetcleaned_surveys <- surveys |>mutate(# Parse dates variables as date objectsfield_date_from =ymd(field_date_from),field_date_to =ymd(field_date_to),date_elec =ymd(date_elec),# Calculate the number of fieldwork daysfieldwork_days =as.numeric(field_date_to - field_date_from +1) ) |>filter(!exit_poll, # Exclude exit polls date_elec >=ymd("2008-01-01"), # Exclude polls referred to elections before 2008 size >=750, # Exclude polls with sample size < 750 fieldwork_days >1# Exclude polls with 1 or fewer fieldwork days )# Deleting columns that only have NAscleaned_surveys <- cleaned_surveys |>select(where(~!all(is.na(.))))# Identify party columns dynamicallymetadata_columns <-c("type_survey", "date_elec", "id_pollster", "pollster", "media","field_date_from", "field_date_to", "fieldwork_days", "exit_poll", "size", "turnout")party_columns <-setdiff(colnames(cleaned_surveys), metadata_columns)# Reshape data into long formattidy_surveys <- cleaned_surveys |>pivot_longer(cols =all_of(party_columns), # Reshape party columnsnames_to ="party_raw", # Raw party namesvalues_to ="votes"# Corresponding voting intentions )# add on party names by codetidy_surveys <- tidy_surveys %>%mutate(party =case_when( party_raw =="PSOE"~"PARTIDO SOCIALISTA OBRERO ESPAÑOL", party_raw =="CIU"~"CONVERGÈNCIA I UNIÓ", party_raw =="EAJ-PNV"~"PARTIDO NACIONALISTA VASCO", party_raw =="ERC"~"ESQUERRA REPUBLICANA DE CATALUNYA", party_raw =="IU"~"UNIDAS PODEMOS - IU", party_raw =="PP"~"PARTIDO POPULAR", party_raw =="BNG"~"BLOQUE NACIONALISTA GALLEGO", party_raw =="CS"~"CIUDADANOS", party_raw =="EH-BILDU"~"EH - BILDU", party_raw =="PODEMOS"~"UNIDAS PODEMOS - IU", party_raw =="VOX"~"VOX", party_raw =="MP"~"MÁS PAÍS",TRUE~"OTHER") )# Create a column for proper, unqique acronymstidy_surveys <- tidy_surveys |>mutate(party_code =case_when( party =="UNIDAS PODEMOS - IU"~"PODEMOS-IU", party =="OTHER"~"OTHER",TRUE~ party_raw) )# Select relevant columns# Getting rid of type_survey, exit_poll (take only 1 value), party_rawfinal_surveys <- tidy_surveys |>select(-type_survey, -exit_poll, -party_raw) |>relocate(fieldwork_days, .after = field_date_to) |>relocate(votes, .after = party_code) # Summing all votes based on the party reclassificationfinal_surveys <- final_surveys |>group_by(across(-votes)) |>summarize(votes =sum(votes, na.rm =TRUE), .groups ="drop") |>arrange(field_date_from)# We have 1614 surveys (rows from cleaned_surveys), 12 parties (meaning 12 rows per survey). Thus 1614x12=19368 rows# Previewfinal_surveys
The election_data file is large and requires quite extensive cleaning to make it “tidy”. We will tidy the data to try make it most useful for future analysis. The election data starts off with 48,737 rows and 471 columns. Reducing the number of columns is a clear priority.
First, we look at the quality of the data and see if any information is redundant and can be removed.
plot_intro(election_data)
# We see 1.9% missing colums, identify the cols with no data - we have 9 cols. blank_cols <-names(election_data)[sapply(election_data, function(x) all(is.na(x)))]# Drop these columns and also filter to ensure no info outside 2008 to 2019 is included. election_data <- election_data |>select(-all_of(blank_cols)) |>filter(anno >=2008& anno <=2019)# Drop columns that are logicalelection_data <- election_data %>%select(where(~!is.logical(.)))
# See the improvementsplot_intro(election_data)
Second, we begin to make the election data tidy. We start by pivoting the data so all columns for party names are within one “party” variable. Before this we have 414 columns referring to parties.
# Pivot all the party names and ballot counts to the main tableelection_pivot <- election_data |>pivot_longer(cols =`BERDEAK-LOS VERDES`:`COALICIÓN POR MELILLA`, # select all party datanames_to ="party",values_to ="ballots" )str(election_pivot)
We now have a table with 20,177,118 rows and 17 columns.
This is more clean than previously, but we still need to aggregate of our party variables into the main party groups. We will do this by creating a mapping table (party_names) that standardizes the raw party names into main party groupings (party_main) using regular expressions.
party_names <-tibble(names =unique(election_pivot$party))# Party names in the election_data file do not match up perfectly with the abbrev file (i.e. some of the names present in party_names are not in abbrev)# So it is better to work directly on party_names instead of using abbrevparty_names <- party_names |>mutate(party_main =case_when(str_detect(names, "(?i)PSOE|PARTIDO DOS SOCIALISTAS DE GALICIA|PARTIDO SOCIALISTA DE EUSKADI|PARTIDO SOCIALISTA OBRERO ESPAÑOL|PARTIT SOCIALISTA OBRER ESPANYOL") ~"PARTIDO SOCIALISTA OBRERO ESPAÑOL",str_detect(names, "(?i)PARTIDO POPULAR") ~"PARTIDO POPULAR",str_detect(names, "(?i)CIUDADANOS-PARTIDO DE LA CIUDADANIA|CIUDADANOS-PARTIDO DE LA CIUDADANÍA|CIUDADANOS PARTIDO DE LA CIUDADANIA|CIUDADANOS PARTIDO DE LA CIUDADANÍA|CIUDADANOS, PARTIDO DE LA CIUDADANÍA|CIUTADANS") ~"CIUDADANOS",str_detect(names, "(?i)EUZKO ALDERDI JELTZALEA-PARTIDO NACIONALISTA VASCO") ~"PARTIDO NACIONALISTA VASCO",str_detect(names, "(?i)BLOQUE NACIONALISTA GALEGO|BNG") ~"BLOQUE NACIONALISTA GALLEGO",str_detect(names, "(?i)CONVERGENCIA I UNIO|CONVERGÈNCIA I UNIÓ") ~"CONVERGÈNCIA I UNIÓ",str_detect(names, "(?i)PODEM|EZKER BATUA|EZKER ANITZA|IZQUIERDA UNIDA|ESQUERRA UNIDA|ESQUERDA UNIDA") ~"UNIDAS PODEMOS - IU",str_detect(names, "(?i)ESQUERRA REPUBLICANA") ~"ESQUERRA REPUBLICANA DE CATALUNYA",str_detect(names, "(?i)BILDU|EUSKO ALKARTASUNA|ARALAR|SORTU|ALTERNATIBA") ~"EH - BILDU",str_detect(names, "(?i)MÁS PAÍS") ~"MÁS PAÍS",str_detect(names, "(?i)VOX") ~"VOX",TRUE~"OTHER") )unique(party_names$party_main)
# Adding party code to party_names dataframeparty_names <- party_names |>left_join(party_info, by =join_by(party_main == party))
Now join on the main party names and codes to our election table. Testing was undertaken and the join of a table was more efficient than alternatives (e.g. str_detects over election_pivot or rowwise summaries).
# Join party main and party code into main dfelection_pivot <- election_pivot |>left_join(party_names, by =join_by(party == names))
Now we will include some additional information that will make the analysis potentially easier later, including province and total votes counts from our data:
# Create municipal code to join on municipal names. # Create total votes columnelection_pivot <- election_pivot |>mutate(cod_mun =paste(codigo_ccaa, codigo_provincia, codigo_municipio, sep="-"),total_votes = votos_blancos + votos_nulos + votos_candidaturas)# Join municipality nameselection_pivot <- election_pivot |>left_join(cod_mun, by =join_by(cod_mun)) # Check quality of the join and whether NA's have been introduced as municipality namesany(is.na(election_pivot$municipio))
[1] FALSE
Be careful not all 8135 municipalities appear in each election. We have 6 elections and 414 parties, thus we should have 6x414=2484 occurrences for each municipality, but that is not the case.
Also be careful some municipalities have the same name (but different mun_code), so if you ever need to group by municipality remember to group by mun_code instead of municipality.
# Count the number of times each municipaly appears and then get the unique values for that count (not all are 2484) meaning some municipalities are not present in certain electionselection_pivot |>count(cod_mun) |>pull(n) |>unique()
[1] 2484 1656 828 1242 2070
#Number of unique values for cod_mun is different than number of unique values for municipion_distinct(cod_mun$cod_mun)
[1] 8135
n_distinct(cod_mun$municipio)
[1] 8118
Now we need to group together all of the votes for “OTHER” variables and create unique identifiers for each individual election in our dataframes.
Currently we have a table of 22 variables with 20,177,118 rows. We can clean this more.
First, identify the redundant data in our election. We can remove:
tipo_eleccion - because all values = 02. It is not useful vuelta = because all values = 1, it is not useful. geographic variables = we will remove codigo_municipio is included in cod_mun which we joined on from the cod_mun table. We keep the autonomous community and proivnce variables for potential future aggregation and analysis. codigo_distrito_electoral - because every value is zero. It is not useful.
Notably, we have many NA ballot rows and a row for each individual party at each election, where will also try to reduce this when we aggregate the party data with the “party_main” variable created.
# To clean the data more, reduce our dataset and rename key variables so everything is more consistent in Englishtidy_election <- election_pivot |>select(year = anno, month = mes,code_community = codigo_ccaa,code_province = codigo_provincia,code_municipality = cod_mun,municipality = municipio,population = censo,polling_stations = numero_mesas,participation_1 = participacion_1,participation_2 = participacion_2,blank_votes = votos_blancos,null_votes = votos_nulos,valid_votes = votos_candidaturas, total_votes, party_main, party_code, ballots)summary(tidy_election)
year month code_community code_province
Min. :2008 Length:20177118 Length:20177118 Length:20177118
1st Qu.:2011 Class :character Class :character Class :character
Median :2016 Mode :character Mode :character Mode :character
Mean :2015
3rd Qu.:2019
Max. :2019
code_municipality municipality population polling_stations
Length:20177118 Length:20177118 Min. : 3 Min. : 1.000
Class :character Class :character 1st Qu.: 144 1st Qu.: 1.000
Mode :character Mode :character Median : 454 Median : 1.000
Mean : 4249 Mean : 7.261
3rd Qu.: 1858 3rd Qu.: 3.000
Max. :2384269 Max. :3742.000
participation_1 participation_2 blank_votes null_votes
Min. : 0 Min. : 0 Min. : 0.00 Min. : 0.00
1st Qu.: 57 1st Qu.: 86 1st Qu.: 1.00 1st Qu.: 1.00
Median : 185 Median : 278 Median : 3.00 Median : 4.00
Mean : 1640 Mean : 2448 Mean : 28.71 Mean : 29.84
3rd Qu.: 720 3rd Qu.: 1109 3rd Qu.: 12.00 3rd Qu.: 16.00
Max. :1022073 Max. :1531231 Max. :17409.00 Max. :16527.00
valid_votes total_votes party_main party_code
Min. : 1 Min. : 2 Length:20177118 Length:20177118
1st Qu.: 106 1st Qu.: 109 Class :character Class :character
Median : 336 Median : 343 Mode :character Mode :character
Mean : 3025 Mean : 3084
3rd Qu.: 1364 3rd Qu.: 1393
Max. :1847096 Max. :1872679
ballots
Min. : 1
1st Qu.: 3
Median : 15
Mean : 372
3rd Qu.: 93
Max. :919701
NA's :19781159
Area information -> code_community (autonomous community), code_province, code_municipality, municipality, population
General election information -> polling_stations, participation_1, participation_2, blank_votes, null_votes, valid_votes, total_votes
Party votes received -> party_main, party_code, party_ballots
Creating turnout dataframe
Creating a dataframe storing all the turnout data for each municipality in each election in case we need to work just on turnout or other data that does not change by party.
We have 2 primary datasets at this stage, election data and survey data, plus a turnout dataframe which is a subset of the election data. For surveys, the data has been cleaned so each row represents the votes for one party within a specific national poll. For elections, the data has been cleaned so each row represents the number of votes for a party within an election in a specific municipality.
The final_surveys data includes:
election date, pollster and media information, fieldwork dates
size of the survey and turnout
party name, party code
votes received (for that party in that poll)
The final_election data includes:
date of the election
party name, party code (with non-primary parties grouped)
identifier for autonomous community, province and municipality
municipality population
election information such as number of polling stations, votes per session
ballots received (for that party per election in each municipality)
The turnout data includes:
information on the number of votes and type of vote (e.g. valid or blank/null) per municipality in each election.
!!!!!!! WORK ON DATAFRAMES final_surveys, final_election, turnout !!!!!!! !!!!!!! DO NOT OVERWRITE THESE DATAFRAMES, CREATE NEW ONES IF YOU NEED TO MODIFY THEM (ex. surveys_q1 <- final_surveys) !!!!!!!
Mandatory questions
1.Which party was the winner in the municipalities with more than 100,000 habitants (census) in each of the elections?
2. Which party was the second when the first was the PSOE? And when the first was the PP?
3. Who benefits from low turnout?
4. How to analyze the relationship between census and vote? Is it true that certain parties win in rural areas?
5. How to calibrate the error of the polls (remember that the polls are voting intentions at national level)?
6. Which polling houses got it right the most and which ones deviated the most from the results?
Additional questions
Confirm with the group your original analysis question to avoid clashing ideas :)
SOME IDEAS FOR THE ORIGINAL QUESTIONS TO START?
Which regions had the most predictable votes (i.e. consistently voted for the same party) and which regions were the most undecided (i.e. had the most variance in there votes across) between the 2008 and 2019 elections?
✅ Map the outcomes over time - plotly on the results? Isabel has some interest but tbc if we will Map.
Can we load in other data? think that would go down well? Maybe predict the next election results based on previous trends of the 5 years and compare to see if the following election followed the trend? Think Javi would be happy with new data.
Which municipalities voting patterns were most consistent with the national trends?
Which 2 media/pollster outlets had the most polarized estimates of each election?
7. Jacklyn and Linghan
8. Yijia and Diego
How has electoral support for smaller parties evolved over time?
The maps show a significant increase in support for smaller parties over time by autonomous community, with darker regions indicating higher percentages. Madrid, for example, rises from around 10% in 2008 to a peak of 54% in April 2019. Northern communities including Catalonia, Navarra, and the Basque Country consistently show high levels of support for small parties.
The April 2019 election marked the height of support for smaller parties overall, but the November 2019 election, a re-vote triggered by the failure to form a majority coalition, shows a slight decline in their support, as reflected in the lighter shading across many regions.